Working with data

Here are a few tools you can use when you are working with data!

A few shoutouts:

  • R 4 Data Science: This is the ultimate guide. Ready to pick up and learn.
  • Stat 545: Ultimate guide #2. Not as ready to pick up and learn but very useful.

A few first steps

  1. Download R
  2. Download R Studio

R notebooks vs. R markdown vs. R script

Notebooks and markdowns are great ways to annotate/write as you code… This is the perfect example! We can write text to produce a document and code in chunks like this below:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gapminder)

R script are the original way of coding in R and it doesn’t have the nice functionality of writing notes while you code… though if you don’t need notes/writing, just throw all your code in an R script and you can run the whole thing in a single go pretty easily..

Load packages

Wait! What are packages?

R packages are a unit of shareable code. * more info here

This means, people can download an R package and use the code that is written in there. Of course, you can build all your functions if you wanted to but if other’s have already done so… and they’ve been peer reviewed and tried and tested… why not use those?

Don’t reinvent the wheel!

The most important packages you will ever use

  • Tidyverse
    • a bunch of packages built into one cool super R package

Honestly… this is all you need. But if you want more…

  • DT
  • epitools
  • all markdown associated packages

There are tons…

Ok Now let’s load packages

Install package first… only have to do this once. ever. Use the library function to load the code in the package

# install.packages("tidyverse")
library(tidyverse)
# install.packages("gapminder")
library(gapminder)

Gapminder on the Fly

What is Gapminder?

Free open dataset from Gapminder (an organization). * more info here

Need to download gapminder and load it before you can see it.

FYI Other open dataset here.

Loading your data

There are a few tools for you to use…

Gapminder is already loaded in through the package… so we’re all good here.

But if you want import your data… you can use read_csv() and if oyu put a question mark in front of it and read it… well. you’ll get all the info you need. You ca also use the import dataset on the top right of your R studio console and look for Import Dataset.. you’ll get ton’s of options there.

?read_csv

Viewing your dataset

First thing we want to do is get familiar with our data…

  • What does my data look like - long? wide?
  • How many records?
  • What are my variables?
  • What type of variables are they?

View entire dataset

view(gapminder)

View summaries of dataset

str(gapminder)
## tibble[,6] [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

head function to view first 5 rows

head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

you can make this a whole lot prettier with kable() and kable_styling() functions fromknitr and kableExtra packages

library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
gapminder %>% 
  head() %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134

Basic statistical summaries of variables

summary(gapminder) 
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

Ok we know our data!

but wait.. quick changes… trust me..

gapminder <-
  gapminder %>% 
  mutate(country = as.character(country),
         continent = as.character(continent))

Some research questions…

Does GDP impact life expectancy? … others?

Does GDP impact life expectancy?

Distribution of variables

summary(gapminder$lifeExp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.60   48.20   60.71   59.47   70.85   82.60
summary(gapminder$gdpPercap)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    241.2   1202.1   3531.8   7215.3   9325.5 113523.1

Life expectancy

Plot the distribution of each variable (univariate) and the bivariate

gapminder %>% 
  ggplot(., aes(x = lifeExp)) +
  geom_histogram() +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "All the data.. no filters",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ok but maybe we want to filter on year or look across years?? so we see all the countries at one point in time?

Let’s look across time.

gapminder %>% 
  ggplot(., aes(x = lifeExp)) +
  geom_histogram() +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "by year",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  ) +
  facet_wrap(~year)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s look at one year.. say 2007

gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = lifeExp)) +
  geom_histogram() +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "in 2007",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

… I don’t like these graphs as much yet so let’s make them a bit prettier with the ggthemes package.

library(ggthemes)
library(hrbrthemes)
hrbrthemes::import_roboto_condensed()
## You will likely need to install these fonts on your system as well.
## 
## You can find them in [/Library/Frameworks/R.framework/Versions/4.0/Resources/library/hrbrthemes/fonts/roboto-condensed]
theme_set(theme_bw())
gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = lifeExp)) +
  geom_histogram() +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "in 2007",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Or…

theme_set(theme_ipsum())

gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = lifeExp)) +
  geom_histogram() +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "in 2007",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also make these interactive…

We can also make these interactive with plotly or ggiraph

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p <- gapminder %>% 
  ggplot(., aes(x = lifeExp)) +
  geom_histogram(colour = "black", size = 0.25) +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "in 2007",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  ) 

ggplotly(p)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

GDP

Plot the distribution of each variable (univariate) and the bivariate

gapminder %>% 
  ggplot(., aes(x = gdpPercap)) +
  geom_histogram(colour = "black", size = 0.25) +
  labs(
    title = "Distribution of GDP per capita",
    subtitle = "All the data.. no filters",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ok but maybe we want to filter on year or look across years?? so we see all the countries at one point in time?

Let’s look across time.

gapminder %>% 
  ggplot(., aes(x = gdpPercap)) +
  geom_histogram(colour = "black", size = 0.25) +
  labs(
    title = "Distribution of GDP per capita",
    subtitle = "by year",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  ) +
  facet_wrap(~year)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s look at one year.. say 2007

gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = gdpPercap)) +
  geom_histogram(colour = "black", size = 0.25) +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "All the data.. no filters",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p <- gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = gdpPercap)) +
  geom_histogram(colour = "black", size = 0.25) +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "All the data.. no filters",
    x = "Life expectancy",
    y = "Frequency",
    caption = "Data source: Gapminder from gapminder package"
  ) 

ggplotly(p)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Life expectancy by GDP

gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "in 2007",
    x = "GDP per capita",
    y = "Life expectancy (years)",
    caption = "Data source: Gapminder from gapminder package"
  ) 

How about adding country as another data element?

library(viridisLite)
gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = gdpPercap, y = lifeExp, colour = as.character(country))) +
  geom_point() +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "in 2007",
    x = "GDP per capita",
    y = "Life expectancy (years)",
    caption = "Data source: Gapminder from gapminder package"
  ) +
  scale_colour_viridis_d() 

Ok, the legend doesn’t work well with the graph.. too many countries! maybe take out the legend for now.

gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = gdpPercap, y = lifeExp, colour = country)) +
  geom_point() +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "in 2007",
    x = "GDP per capita",
    y = "Life expectancy (years)",
    caption = "Data source: Gapminder from gapminder package"
  ) +
  scale_colour_viridis_d() +
  theme(
    legend.position = "none"
  )

That’s better.

interactive? Yes please. You’ll see we can actually add in the legend again with the interactivity.

p <- gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = gdpPercap, y = lifeExp, colour = country)) +
  geom_point() +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "in 2007",
    x = "GDP per capita",
    y = "Life expectancy (years)",
    colour = "Country",
    caption = "Data source: Gapminder from gapminder package"
  ) +
  scale_colour_viridis_d() 

ggplotly(p)

Transformations

See here for transformations

Our data doesn’t look linear… transform our data? Use scale_x_log10()

gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(., aes(x = gdpPercap, y = lifeExp, colour = country)) +
  geom_point() +
  geom_smooth(aes(group = 1), lty = 2, colour = "grey80", se = F) +
  labs(
    title = "Distribution of life expectancy",
    subtitle = "in 2007",
    x = "Log GDP per capita",
    y = "Life expectancy (years)",
    colour = "Country",
    caption = "Data source: Gapminder from gapminder package"
  ) +
  scale_colour_viridis_d() +
  theme_bw() +
  theme(
    legend.position = "none"
  ) +
  scale_x_log10()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

or create a new column in your data! using mutate()

gapminder2 <- 
gapminder %>% 
  mutate(loggdp = log10(gdpPercap)) 

gapminder2 %>% 
  head()
## # A tibble: 6 x 7
##   country     continent  year lifeExp      pop gdpPercap loggdp
##   <chr>       <chr>     <int>   <dbl>    <int>     <dbl>  <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.   2.89
## 2 Afghanistan Asia       1957    30.3  9240934      821.   2.91
## 3 Afghanistan Asia       1962    32.0 10267083      853.   2.93
## 4 Afghanistan Asia       1967    34.0 11537966      836.   2.92
## 5 Afghanistan Asia       1972    36.1 13079460      740.   2.87
## 6 Afghanistan Asia       1977    38.4 14880372      786.   2.90

Plot that

p <- 
gapminder2 %>% 
  filter(year == 2007) %>%
  ggplot(., aes(x = loggdp, y = lifeExp, colour = country)) +
  geom_smooth(aes(group = 1), lty = 2, colour = "grey80", se = F) +
  geom_point() +
  labs(
    title = "Distribution of life expectancy in 2007",
    subtitle = "in 2007",
    x = "Log GDP per capita",
    y = "Life expectancy (years)",
    colour = "Country",
    caption = "Data source: Gapminder from gapminder package"
  ) +
  scale_colour_viridis_d()

ggplotly(p)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Linear model fit

gapminder2 %>% 
  filter(year == 2007) %>%
  lm(data = ., lifeExp ~ gdpPercap)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = .)
## 
## Coefficients:
## (Intercept)    gdpPercap  
##   5.957e+01    6.371e-04
#don't forget to log transform
gapminder2 %>% 
  filter(year == 2007) %>%
  lm(data = ., lifeExp ~ log10(gdpPercap))
## 
## Call:
## lm(formula = lifeExp ~ log10(gdpPercap), data = .)
## 
## Coefficients:
##      (Intercept)  log10(gdpPercap)  
##             4.95             16.59

linear regression line

gapminder2 %>% 
  filter(year == 2007) %>%
  ggplot(., aes(x = loggdp, y = lifeExp, colour = country)) +
  geom_smooth(method = "lm", aes(group = 1), lty = 2, colour = "grey80", se = F) +
  geom_point() +
  labs(
    title = "Distribution of life expectancy in 2007",
    subtitle = "in 2007",
    x = "Log GDP per capita",
    y = "Life expectancy (years)",
    colour = "Country",
    caption = "Data source: Gapminder from gapminder package"
  ) +
  scale_colour_viridis_d() +
  theme(
    legend.position = "none"
  )
## `geom_smooth()` using formula 'y ~ x'